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Evidence-centered design (ECD) is a comprehensive framework for describing the 
conceptual, computational and inferential elements of educational assessment. 1 t 
emphasizes the importance of articulating inferences one wants to make and the evidence 
needed to support those inferences. At first blush, ECD and educational data mining 
(EDM) might seem in conflict: structuring situations to evoke particular kinds of 
evidence, versus discovering meaningful patterns in available data. However, a dialectic 
between the two stances increases understanding and improves practice. We first 
introduce ECD and relate its elements to the broad range of digital inputs relevant to 
modern assessment. We then discuss the relation between EDM and psychometric 
activities in educational assessment. We illustrate points with examples from the Cisco 
Networking Academy, a g lobal program in which information technology is taught 
through a blended program of face-to-face classroom instruction, an online curriculum, 
and online assessments. 
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1. INTRODUCTION 

Data mining is the process of extracting patterns from large data sets, for 
purposes that include systems enhancement and scientific discovery [Witten and 
Fra nk 1999]. Educational data mining (EDM) in particular aims to provide 
insights into instructional practices and student learning, often using data from 
assessments and learning experiences, both formal and informal [Romero et al. 
2011]. Applying exploratory methods to existing data seems to contrast with 
forward-design process of developing assessments. 

This paper explores the productive dialectic that can be developed between 
EDM and principled assessment design as seen from the perspective of evidence- 
centered design (ECD) [Almond et al. 2002; Mislevy et al. 2003]. We first 
present an overview of ECD, with an eye toward complex assessments. We then 
discuss the relationship between psychometric and EDM activities in assessment, 
and use the ECD perspective to highlight productive connections between EDM 
and assessment. 

Points are illustrated with brief examples from the literature and from our own 
work with the Cisco Networking Academy (CNA) 
[www.cisco.com/web/leaming/netacad/index.html; see also Rupp et al. this 
issue]. The CNA is a global program in which beginning computer network 
engineering and ICT literacy is taught through a blended program of face-to-face 
classroom instruction, an online curriculum, and online assessments. Courses are 
delivered at high schools, 2- and 3-year community college and technical 
schools, and 4-year colleges and universities. Since its inception in 1997, the 
CNA has grown to reach a diverse population of about a million students each 
year in more than 165 countries [Mumane et al. 2002; Levy and Mumane 2004]. 
Behrens et al. [2005] discuss the framework that drives the ongoing assessment 
activity from which our illustrations are drawn. 

2. ASSESSMENT, ECD, AND PSYCHOMETRICS 

Most familiar applications of educational assessment are framed in what we will 
call the standard assessment paradigm. Data from each student are sparse, 
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typically discrete responses to perhaps 30 to 80 test items. The items are 
predefined. The target of inference is a student’s level of proficiency in a domain 
framed in trait or behaviorist psychology and defined operationally by the items. 
Learning during the course of assessment is assumed to be negligible. We can 
view the standard assessment paradigm as a subspace of assessment viewed more 
broadly, where any or all of the familiar constraints could be relaxed: continuous 
performances in interactive environments, for example; richer data that 
encompass many aspects of activity at any level of detail; interest in multiple 
aspects of proficiency, evoked in different combinations in different situations; 
learning may occur, and may indeed be an aim of the experience. 

By definition, psychometrics is measuring educational and psychological 
constructs. Psychometrics in educational testing has focused mainly on da ta 
produced in the standard assessment paradigm. Much progress in test theory has 
been made “by treating the study of the relationship between responses to a set of 
test items and a hypothesized trait (or traits) of an individual as a problem of 
statistical inference” [Lewis 1986, p. 11 ]. Probabilistic test theory models allow 
an analyst to characterize the informational value of data about students in a 
probabilistic framework, and to use data from different tasks to draw inferences 
in terms of the same proficiencies. These are powerful inferential tools for 
practical work in assessment. 

The challenge for educational assessment is to jointly harness EDM 
capabilities to deal with the richer data environment in which we can now carry 
out assessment, and the inferential strengths of psychometric methods that have 
evolved for inference with data from the standard assessment paradigm. 

The way forward is an assessment framework that encompasses both 
perspectives, and supports the design and analysis of both familiar assessments 
and new ones that take advantage of technological advances to move beyond the 
standard assessment paradigm. Such a framework would embrace concepts and 
methods from EDM as well as from existing psychometrics. Recent work in 
assessment provides a suitable foundation. One line of progress is the conception 
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of assessment as argument [Cronbach 1980, 1988; Kane 1992, 2006; Messick 
1989, 1994]. Another is the integration of psychometric modeling, assessment 
design, and cognitive theory [Embretson 1985; Pellegrino et al. 2001; Tatsuoka 
1983]. ECD builds on this research to provide a framework for analyzing and 
carrying out assessment design and implementation. We will use it to bring out 
productive and natural roles of EDM in the assessment enterprise. 

3. EVIDENCE-CENTERED DESIGN 

Assessment in the standard assessment paradigm is thought of just in terms of the 
highly scripted circumstances in which students solve constrained tasks, usually 
answering verbal questions, and results that simply accumulate independent item 
scores. The ECD framework neither requires nor implies this limited view of 
assessment. It is flexible enough to describe a wide range of activities and goals 
associated with assessment as co nceived more broadly, including familiar tests 
but accommodating the informal assessment activities of instructors interacting 
with students in the classroom, students working through open-ended simulation 
tasks [Frezzo et al. 2009; Mislevy 2011; Williamson et al. 2004], multi-student 
interactions in role playing or simulated situations [Shute 2011], and game-based 
assessments [Behrens et al. 2007; Mislevy et al. in press]. 

ECD emphasizes the specification of the logic of assessment, or the 

evidentiary argument [Embretson 1983], Messick [1994] describes the structure 

of an assessment argument as follows: 

A construct-centered approach would begin by asking what complex of 
knowledge, skills, or other attributes should be assessed, presumably 
because they are tied to explicit or implicit objectives of instruction or 
are otherwise valued by society. N ext, what behaviors or performances 
should reveal those constructs, and what tasks or situations should elicit 
those behaviors? Thus, the nature of the construct guides the selection or 
construction of relevant tasks as w ell as the rational development of 
construct-based scoring criteria and rubrics, (p. 16) 

ECD formalizes this structure with an explicit framework for designing and 
implementing assessments. The following section sketches the key concepts and 
representations. At a given point in time, for some practical assessment purpose, 
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one can use ECD to design the components of an operational assessment. Design 
is practical, and it is provisional as well. Especially with more complex forms of 
assessment, we expect our understanding of the nature of proficiency to improve 
as we explore patterns in data from a given form of the assessment [Behrens et al. 
2012], Bringing EDM tools to bear on the data at any given point in time can 
thus lead to deeper understanding and improvements for assessment design and 
analysis in the next version of the assessment. 

3.1 Assessment Components and ECD 

Figure 1 distinguishes five ECD “ layers ” at which different types of thinking and 
activity occur in the development and operation of assessment systems [Mislevy 
and Riconscente 2005, 2006], 




Conceptual Assessment 
Framework 


H 


Design structures: Student, evidence, and 
task mo dels. Gen erativity. 


4 ^ 


Assessment 

Manufacturing “nuts & bolts”: 


Implementation 

authoring tasks, automated scoring 
details, statistical models. Reusability. 


J !7 1 



Assessment Delivery 


Students interact with tasks, 
performances evaluated, 
feedback created. Four- 
process delivery architecture. 


Fig. 1. Layers in the evidence-centered assessment design framework. 
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Each layer contains components, processes and representations that are 
appropriate for the kinds of activities that take place in that layer (see Mislevy et 
al. 2010 on the central role of representations in ECD). Formal object models 
[Rumbaugh et al. 1991] have been implemented in the Portal design system 
[Almond et al. 2003; Steinberg et al. 2005] and the Principled Assessment Design 
for Inquiry (PAD1) design system [Riconscente et al. 2005]. Although the figure 
might suggest a 1 inear design and implementation process, iterative feedback 
loops are essential to successful designs. In the second part of the paper we will 
discuss critical roles that EDM can play in iterative design in assessment. 

Table 1 summarizes how the assessment layers play out in classroom 
instruction, standardized accountability assessment, and interactive diagnostic 
computer systems. The terminology for the layers and their components that 
appears in the table will be developed as we describe the layers in turn. 


Table I. Summary of Layers of the ECD Framework and Conceptualization of 


Activity from the ECD Perspective for Three Kinds of Assessment 


ECD Layer 

Epistemic 

Focus 

Classroom 

Instruction 

Standardized 

Accountability 

Measure 

Tutoring 

System 


Attend to 
specific 
strengths and 
errors while 
managing 
administrative 
requirements. 

Broad 

inference from 
wide sample of 
performance. 

Inference 
regarding 
specific 
functional 
states of 
students’ 
knowledge and 
skill and 
providing 
experiences to 
improve them. 

Domain 

Analysis 

Understand 
proficiencies, 
conditions of 
use, practices, 
representations, 
standards, 
activities, etc. in 
the targeted 
domain. 

Teacher’s 
background 
studies of 
learning and of 
the curricular 
goals. 

Common texts 
associated with 
curriculum; 
ongoing 
scientific 
activity feeding 
in to standards 
and practices. 

Cognitive task 
analysis; 
protocol 
analysis; 
literature 
related to 
domain. 


(continued) 
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ECD Layer 

Epistemic 

Focus 

Classroom 

Instruction 

Standardized 

Accountability 

Measure 

Tutoring 

System 

Domain 

Modeling 

What are 
relationships 
among 
proficiencies, 
task situations, 
and 

performance in 
such situations; 

What is 
important for 
the purpose(s) 
of the 

assessment? 

Teacher’s 
mental model 
of curriculum 
elements & 
dependencies, 
and situations 
for learning 
about students’ 
proficiencies. 

Standards 
documents and 

assessment 
frameworks; 
e.g., Common 
Core State 
Standards, 
National 
Science 
Education 
Standards. 

Specifications 
of production 
rules and their 
combinations. 
Relationship 
of 

performances 
& products to 
production 
rules. 

Conceptual 

Assessment 

Framework 

(CAF) 

What are the 
linkages 
between tasks & 
evidence about 
proficiency; i.e., 
what are 
schemas for 
tasks, 

procedures to 
capture and 
evaluation 
performance? 

SM: Aspects 
of student 
performances 
worth tracking. 
TMs: Informal 
catalog of 
kinds of tasks 
for 

administrative 
ease as well as 
verisimilitude 
to natural tasks 
(e.g., writing). 
EM: Theory of 
what good 
performance 
looks like, to 
be applied on- 
the-fly. 

MM: Add up 
individual 
items and 
grades. Weight 
differentially if 
desired. 

SM: Core 
dimensions of 
proficiency 
aligned with 
standards. 

TM: Schemas 
for tasks to 
evince 

correctness of 
procedure or 
knowledge. 
EM: Evaluation 
procedures for 
task types (e.g., 
right/wrong, 
partial credit, 
scoring 
rubrics). 

MM: Models 
that maximize 
precision of 
latent variable 
estimate and 
efficiency of 
delivery. 

SM: 

Specification of 
target 

proficiencies- 
production 
rules or 
aggregates of 
them needed to 
guide students’ 
activity. 

TM: 

Specification of 
features of 
tasks 

appropriate for 
different 
aspects of the 
learning 
progression. 
EM: Procedures 
to evaluate 
features of 
performances 
and/or 

products. MM: 
Fine-grained 
model such as 
Bayes net or 
cognitive 
diagnosis 
model. 


(continued) 
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ECD Layer 

Epistemic 

Focus 

Classroom 

Instruction 

Standardized 

Accountability 

Measure 

Tutoring 

System 

Assessment 

Implementation 

Based on 
existing 
knowledge, 
data, and 
specifications 
from above, 
create the 
elements needed 
for the 

assessment. 

Create items 
for quizzes and 
tests. Establish 
grade book. 
Adjust tests to 
match changes 
in curricula. 

Author specific 
tasks; develop 
scoring keys or 
rubrics; 
calibrate IRT 
model; 
Assemble 
forms to 
optimize Test 
Information 
Function, (can 
cycle with field 
tests and 
calibration 
samples) 

Build rules into 
interactive 
system, for 
managing 
information for 
inference and 
instruction 
choices (e.g., 
Bayes nets, 
updating rules 
based on 
learning theory) 
Observe in pilot 
phases and 
modify rules 
and system 
responses. 

Assessment 

Delivery 

Create or co-opt 
circumstances 
to obtain 
relevant 
evidence. 

Observe work 
on classroom 
assignments & 
behavior. 
Update mental 
and grading 
models. Check 
grades against 
overall 
impression or 
specific 
activity 
performance. 
Iterate between 
global and 
diagnostic 
levels. 

Deliver 

common 

activities 
(items), 
perhaps with 
limited 

customization 
such as 
computerized 
adaptive testing 
for efficiency. 
Paper and 
pencil or 
computer based 
delivery. 

Ongoing 
interaction of 
students and 
computer 
system, with 
cycles of 
presentation, 
student activity, 
evaluation, 
feedback and 
adaptation of 
learning 
situation. 

Post Assessment 
Delivery 

Communicate 
inferences and 
implications 

Report card. 
Ongoing 
verbal 
feedback. 

Performance 
report typically 
in relation to 
performance of 
others 

Estimates of 
proficiency and 
reporting of 
particular error 
patterns and 
progress. 


Notes. Within the CAF: SM = student model; EM = evidence model; MM = 
measurement model. 
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3.1.1 Domain Analysis Layer. The first layer of the ECD framework is 
domain analysis. Domain analysis marshals beliefs, representations, and modes 
of discourse for the target domain. This can include best practices, research 
findings, practitioners’ experiences, expert-novice studies, and historical or 
sociological framings of a set of knowledge and skills. Understanding the 
epistemic frame of a domain [Shaffer 2006] helps a designer avoid confusing the 
mastery of isolated tasks with functional mastery of work in a domain. It is not 
simply “the content” of the domain that matters but how people think with that 
content, what they do, and the situations in which they do it. 

3.1.2 Domain Modeling Layer. The second layer of the ECD framework is 
domain modeling. Assessment developers organize insights about the domain 
from domain analysis into the form of assessment arguments, in representations 
that more formally reflect the structure of the Messick quote. They articulate 
structures and dependencies in knowledge, skills and attributes in the domain, 
and the relationships of these capabilities to situations and activities. Useful 
representations include standards formulations, scope and sequences in 
curriculum, concept maps [DiCerbo 2007], hierarchies of skill dependencies or 
progressions, Toulmin diagrams, and assessment design patterns. 

Specifically, Toulmin diagrams for assessment arguments map out the 
relationships among proficiencies, performances, features of work, and features 
of task situations [Mislevy 2006]. Design patterns sketch out a design space for 
task authors, with options and examples that draw on research and experience 
with a certain kind of proficiency [Liu and Haertel 2011]. For example, Mislevy 
et al. [2009] describe a suite of design patterns that help designers create tasks to 
assess model-based reasoning. The conceptualizations in domain modeling that 
ground the design of operational assessments can be continually extended and 
refined as knowledge is acquired in prototypes, field trials, and analyses of 
operational data. 
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3.1.3 The Conceptual Assessment Framework Layer. The CAF lays out more 
formal specifications for the operational elements of an assessment. Designers 
combine domain information with information about goals, constraints, and 
logistics to create a blueprint for an assessment, in terms of psychometric 
models, specifications for evaluating students’ work, schemas for tasks, and, in 
technology-based assessments, specifications of the interactions that will be 
supported. The CAF thus provides structures that bridge work from the domain 
analysis and domain modeling phases and the actual objects and processes that 
will constitute the operational assessment, which are the processes described in 
assessment delivery layer. 

The CAF comprises models (as noted above, software-engineering object 
models in the sense of Rumbaugh et al. 1991) whose objects and specifications 
provide the blueprint for tasks, evaluation procedures, and statistical models and 
delivery and operation of the assessment. The following paragraphs describe the 
central CAF models depicted in Figure 2. 


Student Model 



Evidence Model(s) 



Task Mo d elts) 



1. xxxxxxxx 2. xxxxxxxx 
3. xxxxxxxx 4. xxxxxxxx 
5. xxxxxxxx 6. xxxxxxxx 


Fig. 2. The central models of the conceptual assessment framework. 


A task model is a set of assumptions and structures describing task and 
environment features. Key design elements include the specification of the 
cognitive artifacts and affordances needed to support the student’s activity and 
the forms in which students’ performances will be captured (i.e., work products), 
such as the sequence of steps in an investigation or the final solution of a design 
problem. The variables in task models play key roles in assessment arguments, 
task design, and psychometric models [Mislevy et al. 1999]. 
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The student model contains variables for expressing claims about targeted 
aspects of students’ knowledge and skills, at a grainsize and nature that suits the 
purpose of the assessment. That is, the student model consists of the variables 
and the structure of those variables in the psychometric model used to synthesize 
information about aspects of students’ proficiencies. It formalizes the aspects of 
the capabilities identified in the domain model that will be incorporated into the 
inferential logic of the operational assessment. 

The evidence model bridges the student model and the task model. It consists 
of two components: the evaluation component (i.e., evidence identification 
component) provides the rationale and specifications for how to identify and 
evaluate the salient aspects of work products, which will be expressed as values 
of observable variables. Data that will be generated in the evaluation component 
are synthesized across tasks in the measurement model component (i.e., evidence 
accumulation component). The simplest measurement models contain summed 
scores of salient features of a performance such as t he number or percentage 
correct score. More complicated measurement models such as models from item 
response theory (1RT) [e.g., de Ayala 2009; Hambleton and Swaminathan 1985; 
Lord 1980; Reckase 2009], diagnostic classification models (DCMs) [e.g., Rupp 
et al. 2010], and Bayesian networks (BNs) [e.g., Levy and Mislevy 2004] include 
formal latent variables. For example, BNs extend the concept maps used in 
domain analysis to support probabilistic inference [Jensen 1996], such as 
modeling student skill levels on learning progressions [West et al. 2010]. 
Although the statistics behind BNs can be complex, the graphical displays that 
represent the statistics are more accessible to task designers, teachers, and 
students [DiCerbo 2009]. 

It is in the CAF, and specifically in the student model and measurement 
model component of the evidence model, that the previously mentioned two 
insights of psychometrics in the standard assessment paradigm are implemented - 
characterizing the weight of evidence in a formal probability model, and enabling 
for evidence from different tasks to be synthesized in terms of evidence about the 
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same latent variables for student proficiencies. These ideas are so central to the 
interplay between psychometrics and EDM that we will devote a section to them 
after the survey of the ECD framework. 

As other articles in this special issue demonstrate, simulation and game-based 
assessments are a developing frontier of assessment [see also Rupp et al. 2010; 
Shute 2011]. This is especially the case for the evaluation and measurement 
components of the evidence model. As we noted above, most of the existing 
practices and the language of measurement evolved for tests consisting of 
discrete, pre-packaged, tasks with just a few bits of data. Measurement 
researchers are extending the evidentiary reasoning principles that underlie 
familiar test theory for this kind of data to the new environment of the “digital 
ocean” of data [DiCerbo and Behrens 2012; Junker 2011]. It is these rich, 
complex, and interactive contexts in which EDM will be most valuable. 

3.1.4 Assessment Implementation Layer. The fourth layer of the ECD 
framework is the assessment implementation layer. In this layer, assessment 
practitioners create functioning realizations of the models articulated in the CAF. 
Field test data are used to check model fit and to estimate parameters of the 
operational system. The data structures of tasks and parameters are in the forms 
specified in the CAF models. Some tasks may be omitted from subsequent 
consideration because of unexpected interactions with student characteristics, 
misinterpretation, or other functional issues. 

Assessment implementation interacts with other ECD layers at this point, in 
two directions: moving down the layers, toward operation, the data from field 
tests are used to tune and parameterize tasks and scoring algorithms. Moving 
back up the layers, unanticipated results and new discoveries can lead to 
improvements in the CAF models, further back up to new forms for the elements 
of assessment arguments, or even further back to fundamental advances in 
understanding of the domain. The logic of these iterations is similar in many 
respects to the iterative logic of exploratory data analysis [Tukey 1977; Behrens 
1997; Behrens et al. in press]. As in EDM and exploratory statistical analysis, 
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there is a stance of skepticism, and “multiple-working hypotheses” are iteratively 
reduced through detailed data display, model fit analysis and sensitivity analysis. 

3.1.5 Assessment Delivery Layer. The fifth layer of the ECD framework is the 
assessment delivery layer. In this layer, students interact with tasks, their 
performances are evaluated, and feedback and reports are produced. Almond et 
al. [2002] lay out a four-process delivery architecture / four-process model that 
can be used to describe delivery processes and associated infrastructure 
components for assessments that range from computer-based testing procedures, 
paper-and-pencil tests, informal classroom tests, tutoring systems, and one-to-one 
tutoring interactions. The processes are thus defined in terms of activities and 
information, and could be carried out by computers or humans or some 
combination, and the architecture is indifferent to the implementation. Behrens 
et al. [2008] show how the logic and the structure of this architecture can be 
extended to games. 

Figure 3 shows the principle processes and their interconnections. Next to 
each process are additional symbols indicating relevant data types available in 
complex systems. The activity selection process creates an appropriate task or 
activity, or selects one in light of what is known about the student. The 
presentation process interacts with the student and captures work products. The 
evidence identification process is variously called response processing, feature 
identification, or task-level scoring. This process evaluates work products by 
methods specified in the evaluation component of the evidence model. Thi s 
process sends values of observable variables to the evidence accumulation 
process, or test-level scoring, which uses the measurement models to summarize 
evidence about the student model variables and produce score reports. 
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Fig. 3. High-level view of the four-process delivery architecture. 
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When an assessment is operating, the processes pass messages in a pattern 
determined by the test’s usage - different patterns of scoring, student interaction, 
and reporting are employed for formative tests, interactive simulations, and 
batched tests for state-level surveys, for example. The messages are data objects 
(e.g., parameters, stimulus materials) or are produced by the student or other 
processes in data structures (e.g., work products, values of observable variables) 
specified in the CAF [see Almond et al. 2002, for details on the relationships, and 
Almond et al. 2001, for simple worked-through examples]. 

Both evidence identification and evidence accumulation are fertile grounds 
for EDM. For evidence identification, the challenge is finding, combining, and 
characterizing salient bits of information as features of often-complex work 
products. This activity does not need to be limited to simple matching but can 
come from a broad range of symbolic or statistical computations [Williamson et 
al. 2006]. Automated scoring of spoken responses, for example, can consist of 
multiple stages, from acoustic analysis, to extraction of features using natural 
language processing, to statistical combinations of features to produce scores for 
various aspects of the performance [Bejar 2010]. For evidence accumulation, the 
challenge is determining useful ways of combining, interpreting, and drawing 
inferences from these features. Discoveries feed back as i mprovements to the 
evidence model in the CAF. Insights into what is important to observe give us 
better ideas on how to evoke evidence and produce work products, which feed 
back to the CAF as improved task models. 

Distinguishing the processes of a delivery system brings to light conceptually 
distinct activities in assessment that are obscured in multiple choice testing. 
Standard practice tightly binds the presentation format (multiple-choice items), 
work products (mark an option), evidence identification (matching the marked 
option with the key), and evidence accumulation (count the number of correct 
responses). The articulated architecture emphasizes that the purpose of the 
presentation activity is to elicit a work product that could be a simple choice, but 
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could be a complex result such as an essay, activity log, or complex outcome 
found in the business world (proposal, spreadsheet, or video). 

The interaction among the four processes for a fixed-form paper-and-pencil 
multiple-choice test is a single trip around the cycle if both the content and 
ordering of tasks is fixed. In a computer-adaptive test that maximizes information 
about each student individually, the cycle around all four processes occurs for 
each item: an item is presented and the work product, namely the response, is 
obtained. Evidence identification evaluates its correctness and passes the result to 
evidence accumulation. Evidence accumulation updates belief about the student’s 
proficiency, using for example an 1RT model. This information is passed to 
activity selection, which selects the next item to be most informative, in light of 
responses up to this point [Wainer et al. 2000; van der Linden and Glas 2010]. 

A simulation-based task can require many interactions among the processes. 
Frezzo et al. [2009] describe the interplay among the four processes in the 
context of the CNA's simulation-based Packet Tracer Skills-Based Assessment 
[see also Rupp et al., this issue]. Activity selection is currently done largely 
outside the simulation system, although in the articulated architecture this can be 
changed with minimal changes to the other processes. In the presentation 
process, the simulation and visualization affordances of the Packet Tracer tool 
allow for presenting tasks that include abroad range of networking devices and 
protocols. A variable manager allows for the random, or otherwise algorithmic, 
generation of specific values of features in the environment from lists or numeric 
ranges. Next, in order to evaluate the work products, Packet Tracer provides task 
authors with a comprehensive list of network states, lets them select which low- 
level work product features to use in scoring, then allows them to craft scoring 
rules to apply to these features to create observables. 

Note that observable variables need not be answers to discrete, pre-packaged 
questions, but rather identification of salient features in recurring situations 
within a continuous flow of performance. For example, using an ECD 
framework, DiCerbo and Behrens [2012] argue that as daily activity becomes 
increasingly digital the separation of activity for assessment or non-assessment 
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purposes can be reduced since digital environments are often naturally 
instrumented to collect work products unobtrusively. When considering digital 
work products, the assessment designer is faced with the challenge of encoding 
features of the work product into observations that can be synthesized in the 
evidence accumulation process. This is leverage point for EDM because it 
concerns pattern recognition and dimension reduction. 

Consider for example, Theodoridis and Koutroumbas’s [1999] generic pattern 
recognition process shown as Figure 4. 


(a) 


(b) 



Fig. 4. (a) Pattern recognition process as described in Theodoridis and Koutroumbas [1999]. (b) Corollary 
to scoring and inference process as described in the ECD literature. 


Its steps can be related to the delivery processes of the ECD framework. Output 
from the sensor in this model is equivalent to the work product produced by the 
presentation process. Feature generation is concerned with the creation of 
variables that can be used to describe aspects of the data, corresponding to 
observable variables in ECD. Feature selection is concerned with determining 
which observable values are useful for the evidence accumulation process (or 
perhaps task level or diagnostic feedback as well). Classifier design corresponds 
to the measurement model / psychometrics for classifying students or measuring 
proficiency. 

ECD provides a flexible and abstracted understanding of assessment data and 
its relationship to the evidentiary assessment argument, and advances in 
technology provide new opportunities to link work products to inferences about 
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human states through evidence identification and accumulation. However, in 
many cases t here is weak or little theory that can relate variations in complex 
work product data (e.g., logs, videos, transcripts of group interactions) and 
inferences regarding student states, so methods for pattern extraction and other 
EDM approaches will be needed. 

4. PSYCHOMETRICS AND DATA MINING 

The psychometric paradigm has been the dominant analysis framework for 
educational assessment for more than a century, although mainly with the sparse 
data (at the level of each student) that characterized the standard assessment 
paradigm. The new field of EDM seeks to improve ways of learning and 
assessing that are beyond the reach of established analytic practice. This section 
brings out some essential similarities and differences in the approaches. 

4.1 The Ontology and Epistemology of Psychometrics 

The view that underlies ECD is that assessment is not simply about producing 
scores but about obtaining evidence about aspects of students’ proficiencies, and 
characterizing the meaning and the value of that evidence [Mislevy 1994]. 
Psychometricians use particular kinds of statistical models to quantify these 
arguments, based on the relevant forms and patterns in data. The two key insights 
mentioned earlier are (a) characterizing the value of evidence about students’ 
proficiencies in a probabilistic framework and (b) using latent variable models to 
synthesize evidence from different collections of tasks in a c ommon 
interpretative frame. 

The central insight in characterizing evidence is this: there is a difference 
between what we observe and what we really want to make inferences about, and 
the features of the observational situation impact the quality of our inferences. 
Classical test theory (CTT) [Gullikson 1961] uses standard errors of 
measurement and reliability indices to characterize the accuracy of students’ test 
scores. CTT machinery enables researchers to design tests and compare 
alternative scoring approaches to improve their work [Cronbach et al. 1972]. 
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The meaning of test scores remains closely bound to the particular items that 
make up a given test, however. Except in special cases, there is no 
straightforward way to relate performance on one set of test items to another. 

Latent variable psychometric models incorporate an additional insight: we can 
model the capabilities of people and the features of tasks in ways that enable us 
to draw inferences about people from different observational situations. Latent 
variable models posit probability distributions for patterns in observed variables 
as functions of unobservable (i.e., latent) variables that characterize students’ 
knowledge, skills, strategy repertoires, misconceptions, degree of automaticity, 
or other cognitively relevant aspects of their capabilities. 

The variables in such as model are specified in the student model in the CAL 
of the ECD framework. They are persistent in the sense that they are posited to 
influence performance across some domain of tasks where the set of 
proficiencies they characterize is relevant, and performance in any of the tasks 
provides evidence about these student model variables. Exactly how performance 
in each task depends on the student model variables is specified in the 
measurement models of the CAE. This structure allows a stable frame of 
interpretation across task situations that may differ markedly on the surface. It 
becomes possible to assemble psychometric models for different situations 
according to the features of the situations [Rupp 2002]. 

As with CTT, a latent-variable modeling framework provides a quantitative 
basis for operational matters as such planning test configurations, calculating the 
accuracy and reliability of measurement, figuring out how many tasks or raters 
we need to be sufficiently sure about the appropriateness of decisions based on 
test scores, or monitoring the quality of large-scale assessment systems. These 
models can also be applied to new kinds of testing processes, such as simulation- 
based tasks and game-based assessments [Mayrath et al. 2012]. Rupp et al. [this 
issue] employ DCMs and BNs for CNA's simulation-based Packet Tracer Skills- 
Based Assessment. 

Although both psychometric models and many EDM models are statistical 
models, there are distinguishing characteristics of psychometric models and the 
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way they are used. First is the psychological construal of the latent variables in 
psychometric models. Their interpretation may be cast in terms of behavioral, 
trait, information-processing, or sociocultural psychology [Mislevy 2006], but in 
all cases they effect some view of aspects of students’ capabilities. The 
information in patterns of data is synthesized as evidence to characterize students 
in terms of their standing on these latent variables in the student model. 

Second, the nature and grainsize of these student model variables is shaped by 
the purpose of the assessment. Is the purpose to provide broad feedback about 
students’ general level of proficiency? A psychometric model with few variables, 
perhaps even an 1RT model with a single latent variable for overall proficiency in 
the domain [Lord 1980] and cast in trait theory, might suffice. Is the purpose of 
the assessment to provide diagnostic information to guide instruction? A more 
detailed DCM [Rupp et al. 2010] cast in an information-processing cognitive 
perspective will be better suited. Is the purpose to characterize knowledge and 
strategy use in interactive problem solutions? A modular BN approach with 
models assembled on t he fly [Shute et al. 2009] that draws on a situative 
psychological perspective can be pressed into service. The aim is not simply to 
discover and model patterns in data; it is to model those patterns that are relevant 
to specific, practical, educational purposes, in terms that directly inform those 
purposes. 

Third is the explicit mathematical separation of observed score variables from 
latent student model variables. As noted above, the probability distributions of 
score variables are modeled as a function of student model variables. More 
important technically is that the observed variables - which may be 
characteristics of patterns across lower-level data features - from a given task 
situation are modeled as conditionally independent of data variables from other 
task situations. When such a m odel fits data from some domain of tasks 
satisfactorily, the underlying patterns in performance that are manifest in 
different raw data in different task situations can be modeled in terms of the same 
variables in a student model that can be used with different tasks for different 
students or at different time points. 
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Computer-adaptive tests use this idea with IRT, so that students get different 
items, harder or easier, based on how well they are doing [van der Linden and 
Glas 2010; Wainer et al. 2000]. Cognitive diagnostic tests use different tasks 
based on the same cognitive features for teaching and testing mathematics skills 
[Leighton and Gierl 2007]. Iseli et al. [2010], Shute et al. [2009], and VanLehn 
[2008] use the idea to build BNs on the fly in game-based and simulation-based 
assessments to harvest evidence about students’ skills from the unique situations, 
as agents recognize cognitively-relevant features of the situations. 

In sum, we note that the patterns in data transcend the particulars in which 
they were gathered, in ways that we can talk about in terms of students’ 
capabilities, which we implement as student model variables and organize in 
ways tuned to their purpose. Having the latent variables in the student model as 
the organizing framework allows us to carry out coherent interpretations of 
evidence from a task with one set of surface features to other tasks that may be 
quite different on the surface. The machinery of probability-based inference in 
the evidence accumulation process is used to synthesize information from diverse 
tasks in the form of evidence about student capabilities, and quantifies the 
strength of that evidence. Psychometric models can do these things to the extent 
that the different situations display the pervasive patterns at a more fundamental 
level, because they reflect fundamental aspects of the ways students think, learn, 
and interact with the world. 

4.2 Is Educational Data Mining Psychometrics? 

The preceding section described key ideas of the latent variable models that 
represent advanced application of psychometrics in educational assessment, 
primarily under the standard assessment paradigm. How does EDM relate to 
these ideas? 

We stated earlier that there is a broad range of analytic needs in assessment, 
ranging from support of domain analysis of text, feature extraction of complex 
logs, and methods for inferring connections among assessment activities 
[Behrens et al. 2012]. Many of these activities are not addressed by traditional 
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psychometric models. Nor are they meant to be; the majority of psychometric 
activity under the standard assessment paradigm has focused on t he evidence 
accumulation process, assuming other processes are sufficiently well prescribed 
by the multiple-choice and ordered score categories of the standard assessment 
paradigm. 

Using EDM techniques to detect relevant patterns in lower-level raw data is 
not psychometrics in terms of the key ideas outlined above, but it is it 
undertaking foundational work for broad psychometric modeling; this is feature 
generation and feature selection in Figure 4. Such patterns that are grist for 
defining the observed score variables (i.e., evidence identification, in terms of the 
four-process delivery system architecture) serve as input to psychometric models. 
EDM techniques are ways for discovering or iteratively refining data variables 
from complex performances. 

As examples, Kerr and Chung [this issue] conducted exploratory cluster 
analyses to identify salient features of student performance in an educational 
video game targeting rational number addition, and Hershkovitz and Nachmias 
[2010] used learnograms to identify variables indicative of student motivation 
from logs of student activities in an online learning system for Hebrew 
vocabulary. What is missing from this EDM work from the perspective of 
psychometrics, though, is the dependence of these variables on the latent 
variables. 

Obtaining summary measures of aspects of students’ performance in a 
particular complex task and taking the scores at face value does not incorporate 
the probabilistic contribution of psychometrics. There may indeed be valuable 
information about the performance and about the student, but there is no 
characterization of the evidentiary value of the evidence or of its meaning outside 
the framework of the particular task. 

We can, however, incorporate the key psychometric idea of quantifying the 
value of evidence by using replicate tasks or internal measures of variability such 
as jackknife standard errors [Mosteller and Tukey 1977]. For example, Beck 
[2005] introduced engagement tracing based on response times, and presented 
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reliability evidence for the use of response times to multiple-choice cloze 
questions based on split-half methods. The machinery is in place to experiment 
with alternative scoring methods or data capturing procedures to improve the 
value of the evidence from these particular task situations. 

An EDM model that includes latent variables which are posited to account for 
observable score variables through conditional probability distributions, yet 
remains bound to a particular task or set of tasks, is very close to the spirit of 
latent variable psychometrics. The final step is whether the same student model 
comprising these latent variables can be used to model performance on different 
tasks in the domain - even ones that appear idiosyncratically in games or 
simulations, when recognizable by virtue of their salient features as instances of 
classes of recurring situations. 

For example, Arroyo et al. [2010] used BNs with latent variables to model 
unknown student attitudes and goals (e.g., fear of being wrong, wanting a 
challenge) in a w eb-based tutoring systems for high school mathematics. The 
extent to which the interpretations of latent variables representing the student 
attitudes and goals are restricted to the particular tasks in the system or are 
generalizable to other high school mathematics tasks - or their attitudes with 
respect to other academic domains - is unclear. Absent empirical studies, the 
argument for generalizability of the interpretations of the latent variables rests on 
one of design, drawing strength from a principled approach to design and the 
coherence among the domain analysis, domain modeling, CAF, assessment 
implementation, and assessment delivery layers of the assessment design and 
implementation process. 

4.3 Data Mining as a Reaction to Perceived Limitations of Psychometrics 

The evidentiary reasoning insights of psychometrics are quite powerful for 
familiar kinds of assessments. A century of experience and research and an 
armamentarium of models and techniques exist for modeling data that consist of 
item scores and judges’ ratings of performances. Far less guidance is available 
for modeling the kinds of work that can now be routinely captured in digital 
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environments: every key stroke and time stamp in the log of an open-ended 
troubleshooting task in a simulation environment, for example, or the real-time 
interactions of hundreds of players in an online game, or continuous physical 
monitoring of students as well as their actions, or step-by-step task solutions 
from thousands of students on hundreds of problems in intelligent tutoring 
environments as their proficiencies grow over the course of study. 

With a few exceptions [see, e.g., Ramsay 1982, on the psychometrics of 
functions as data] there simply are not many tools on the traditional psychometric 
shelf to make sense of the complex forms of data that are becoming quite routine. 
On the whole, however, the historically coarse-grained and sparse nature of 
assessment data has led to a greater focus on not only these kinds of data, but 
higher-level psychological constructs. This stands in contrast to more detailed, 
richer, and interactive data and finer-grained modeling of students’ processes and 
strategies. It is thus natural to adapt machinery from other fields that deal with 
masses of data, such as physics, biology, meteorology, intelligence analysis, and 
computational linguistics, to bear on educational problems. We would argue that 
we can improve assessment practice by integrating concepts and machinery from 
the psychometric tradition and the EDM tradition, integrated within the broad 
assessment perspective reflected in ECD. 

4.4 Leverage Points for Educational Data Mining 

There are three particular leverage points for EDM with respect to 
psychometric modeling. They concern (1) the modeling of student proficiencies 
(i.e., the latent variable characterization of aspects of students’ capabilities), (2) 
understanding salient patterns in raw feature data required for evidence 
identification, and (3) understanding relationships between features of evolving 
situations and students’ proficiency-driven actions within those situations. We 
can categorize them by the three main models in the CAF. 

4.4.1 Student Models. These are the latent variable models, the semantics of 
which refer to aspects of students’ capabilities as they might apply across 
different situations, in terms that can be applied to model probabilities across 
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different situations. The constituent statistical variables need not look at all like 
familiar test scores. 

Gitomer and Yamamoto’s [1991] model for understanding logic gate 
problems, for example, was a DCM with student model variables for 
understanding basic operations and common misconceptions that students might 
hold. The model could be used both for isolated problems and for reasoning 
within larger simulation tasks. Mislevy and Gitomer’s [1996] model for 
hydraulics troubleshooting was a BN, with variables for troubleshooting 
strategies and knowledge of subsystems. 

The design objective is to discover, develop, and refine student models that 
are at once consistent with the data, substantively meaningful, and practically 
useful for the job at hand. What should the variables be? How many, what is their 
nature, are there relationships among them such as prerequisition or conjunction? 
Methods from EDM that can be brought to bear on these questions include self 
organizing maps [Pirrone et al. 2003], association rule mining [Garcia et al. 
2010], sequential pattern analysis [Zhou et al. 2010], and process mining [Trcka 
et al. 2010]. There is a clear overlap between such EDM models and 
psychometric models including factor analysis, latent class analysis, cluster 
analysis, and BNs as they are used to address this challenge. 

4.4.2 Evidence Models. This is perhaps the focus of most interest in mining 
massive data from complex performances, especially in interactive digital 
environments. There is not a lot of experience in psychometrics for this kind of 
data, and it is exactly such data that many EDM techniques have been designed 
to explore. It is easy to amass rich and voluminous bodies of low-level data, 
mouse clicks, cursor moves, sense-pad movements, and so on, and choices and 
actions in simulated environments. Each of these bits of data, however, is bound 
to the conditions under which it was produced, and does not by itself convey its 
meaning in any larger sense. We seek relevance to knowledge, skill, strategy, 
reaction to a situation, or some other situatively and psychologically relevant 
understanding of the action. We want to be able to identify data patterns that 
recur across unique situations, as they arise from patterns of thinking or acting 
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that students assemble to act in situations. It is this level of patterns of thinking 
and acting we want to address in instruction and evaluation, and therefore want to 
express in terms of student model variables. 

The following examples illustrate techniques that assessment researchers have 
used along with domain theory to discover data patterns that evidence 
psychological patterns: 

1. In troubleshooting, using logic rules to identify action sequences in space¬ 
splitting situations as consistent with space-splitting, serial elimination, 
remove-and-replace, redundant, and irrelevant [Mislevy and Gitomer, 1996]. 

2. In evaluating speaking skills in a language testing, using supervised neural 
networks to identify phonemes, then words, in acoustic streams [Bernstein 
1999]. 

3. In marksmanship training, using graphical analysis to identify and correlate 
patterns of breathing and trigger break timing [Chung et al. 2011]. 

4. In an epidemiology simulation, using unsupervised neural networks to 
discover patterns of systematic and haphazard sequencing of tests [Hurst et 
al. 1997]. 

We note that it is not the data patterns in and of themselves that matter in 
assessment, but how data patterns provide evidence of capabilities that are 
relevant to the purpose of the assessment. Just having gigabytes of keystrokes 
and mouseclicks is not sufficient for claiming one has good evidence for a 
particular purpose. In fact, the process of discovering and using data patterns is 
iterative, in that we capture data (based on c urrent understanding), identify 
salient higher-level features (that we can use these operationally), and continue 
mining lower level data and using our insights to improve the design of situations 
for students to act in and features of their performances to capture and interpret. 

For example, as noted earlier, Kerr and Chung [this issue] report on cluster 
analyses of attempts in an educational video game in which it was found that, in 
some cases, students successfully completed the levels of the game (i.e., solved 
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tasks) using strategies other than the intended strategy. Importantly, such 
alternative solution strategies that worked in early levels (easier tasks) were 
ineffective on later levels (harder tasks). With this finding, the designers of the 
game (assessment) can reorder or redesign game levels (tasks) so that such 
strategies do not yield a solution. 

4.4.3 Task Models. Although data mining of features of situations is entwined 
with data mining of features of performance, we break it out separately here 
because it has been until recently a neglected area in psychometrics. As noted 
previously, the key point is that students’ actions make sense only in terms of the 
situations they are in. This insight was easy to slide over with standard tests, 
because tasks were fabricated by expert test developers, who knew what features 
to build into them to evoke what kinds of evidence of knowledge and skill. It was 
enough for a psychometrician to know simply that the features were there, and 
she had only to focus on performance data, say right or wrong answers. The 
problem of how situation features determine the meaning of performance features 
cannot be avoided in continuous, evolving, and digitally mediated performance 
tasks such as in games and simulations. 

These performance tasks may include fixed-form work products, such as 
interim reports and final solutions that can be modeled using modest extensions 
of familiar psychometric techniques. But the moment-by-moment situations that 
students act in, and from which the bulk of data may be obtained, arise 
idiosyncratically from students’ performances and the system’s responses to 
them. It is necessary to recognize recurring and substantively salient features of 
situations, so that salient features of performance in those situations can be 
recognized and evaluated. 

The preceding example of identifying space-splitting situations in hydraulics 
troubleshooting was of this character in that it was necessary to parse not only the 
state of the aircraft system but also the information that could be known to the 
student from his earlier actions. Similarly, examples in language testing are 
dyads of speech acts in conversations. A historical example is computer chess, 
where the challenge is to be able to characterize positions in terms of features 
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such as pawn structure and phase of game. An automated evaluation scheme 
must be able to characterize strength of positions in order to compare them 
before and after sequences of possible moves. In other words, one can’t make 
sense of a move without jointly making sense of situation [Bleicher et al. 2010]. 

4.5 Additional Leverage Points for Data Mining in Assessment 

In addition to the traditional emphasis on the analysis of work products and 
the psychometric touch points discussed above, we consider how EDM can 
provide opportunities to improve assessment practice in earlier layers of the ECD 
framework, namely the domain analysis and domain modeling layers that have 
traditionally been viewed as prior to modeling and analysis activity; recall that 
Table 2 provides a summary of the possible applications discussed above as well 
as in this section. 

4.5.1 Domain Analysis. For the educational data miner, including domain 
analysis into the assessment framework means that the increasingly available 
corpora of information available in digital form and related techniques for 
information extraction and knowledge management from extant text can be used 
to improve assessment in new ways [Behrens et al. 2012], They can be used to 
inform understandings of how ideas are represented and used in practice, to 
inform not only curriculum and instruction, but to continually shape assessment 
activity. Consider for example that knowledge management in the social sciences 
remains largely a manual task for researchers. New techniques of mining 
scientific publications can help inform scientists about emerging concepts or data 
in ways that can likewise help assessment developers track such changes. 

4.5.2 Domain Modeling. Elere again we see a valuable and emerging role for 
the data miner. Behrens et al. [2012], for example, discuss the emerging use of 
the data-mining approaches of semantic web technologies (e.g., the Achievement 
Standards Network) to articulate component relationships between standards 
across different educational systems - typically at the country level - and content 
associated with those standards or other standards whose relationship can be 
implied via machine induction over the Resource Description Framework space. 


38 


Journal of Educational Data Mining, Article 2, Volume 4, No 1, October 2012 



This type of hierarchy or relational analysis combined with natural language 
processing (NLP) [Manning and Schiitze 1999] possibilities available for use in 
domain analysis may lead to important insights into the interconnection of 
different types of concepts, standards and assessment activities. 

In previous sections we discussed the role of EDM for improving 
psychometric aspects of the CAF (i.e., the student and evidence models) with 
implications for assessment delivery. We think that as the conceptualization of 
assessment continues to broaden to include the end-to-end operationalization of 
assessment even more opportunities for EDM in assessment will arise. For 
example, in the area of task generation, schemes based on new approaches such 
as crowd sourcing are evolving that will require new tracking and connections of 
data. Likewise, as reporting of assessment results continues to move to on-line 
formats, the data available from the use of the reports and inferences about their 
design and communicative value may become part of the standard domain for 
educational data analysis and EDM. 

5. CONCLUSION 

ECD is a comprehensive framework for describing and understanding assessment 
activities. This paper has discussed the central components and logical features of 
ECD, highlighting the evidentiary focus in obtaining, interpreting, and explaining 
data. We discussed the role of generalization and latent variables in assessment 
thought and how psychometric models support this conceptualization. EDM is an 
emerging technology that provides important insights in several layers in the 
ECD framework if applied within an integrated assessment design and 
implementation endeavor, and can broaden the reach of computational activity to 
refine and automate assessment. 

Key to the understanding of this interplay is that though ECD stresses the 
importance of a priori clarity in the evidentiary arguments that drive design and 
development of assessment, this clarity is likewise informed by insights from a 
posteriori analysis. For example, the development of a networking skill 
performance assessment system in the CNA included the specification of scoring 
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rules based on detailed expert analysis and student protocol analysis constructed 
from an ECD framework [Williamson et al. 2004]. 

However, post hoc analysis of log data using NLP algorithms revealed that 
the empirical log data was not consistent with the theoretical model implied in 
the scoring rules [DeMark and Behrens 2004], The interplay of data and scoring 
theory is especially important in new domains where there may be an absence of 
theory concerning the observable representations of expertise to inform evidence 
model specification. In short, assessment design and development needs BOTH 
rigorous evidential logic AND data analytic insight to support the overall 
evidentiary logic. 

We maintain that EDM, like all data analysis [Behrens and Smith 1996], is 
best practiced in concert with, rather than isolated from, a theoretical or 
substantive layer that involve choices or inteipretations made by researchers 
based on purpose or focus of the assessment. Scholars operating within the EDM 
tradition have advocated as much in applications such as association rule mining 
[Garcia et al. 2010], sequential pattern analysis [Zhou et al. 2010], cluster 
analyses of students [Amershi and Conati 2010] and observations [Hershkovitz 
and Nachmias 2010], and latent variable modeling with BNs [Pardos et al. 2010]. 

We have no doubt that some interpretation will come from novel or surprising 
findings when data are analyzed - in this way, EDM represents a way to realize 
the illuminative goals of exploratory data analysis [Tukey 1977; Behrens 1997; 
Behrens et al. in press]. 

However, we add that assessments will be well served if, as much as possible, 
interpretative aspects are built in a priori through principled assessment design 
and a posteriori through empirical results. In sum, good EDM in assessment 
contexts is best viewed in terms of evidentiary reasoning using the lens of ECD. 
Such a perspective offers (a) a prescriptive approach for assessment design that 
builds the validity argument concurrently with the assessment, and (b) a 
framework for recognizing new findings from data analysis and process for 
refining assessments. 


40 


Journal of Educational Data Mining, Article 2, Volume 4, No 1, October 2012 



In the current era of rapid technological change and the resulting dramatic 
impact on a ssessment, new forms of presentation and work product data are 
being called upon for use in assessment inference. This sea-change highlights the 
limitations of fixed-response paradigms, and invites the use of EDM to inform 
the evidence identification process and feed appropriate information to the 
psychometric (or deterministic) evidence accumulation processes. New types of 
log files, interactional data streams, and other rich work products require both 
psycho-social theory and theory generation based on data analysis. 

Finally, we emphasize that EDM should not be limited to either just the 
outputs of scoring processes or just the work products as the inputs to scoring 
processes. Evolutions in the understanding and practice of assessment call for a 
broader range of concepts and method to be applied consistently with the notion 
of a broader assessment perspective. Assessment design, development, delivery, 
and maintenance processes are complex and increasingly digital. Continued 
evolution of EDM techniques for understanding the structure of data that affects 
assessment design, the tracking of tasks over time and differential performance, 
variations in the task attributes that may be overlooked by human coders, and 
many other artifacts of assessment are increasingly digital and likely to benefit 
from the application of EDM techniques. 
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